This document will guide you through a few data analysis and model fitting tasks.
Below, I provide commentary and instructions, and you are expected to write all or some of the missing code to perform the steps I describe.
Note that I call the main data variable d. So if you see bits of code with that variable, it is the name of the data. You are welcome to give it different names, then just adjust the code snippets accordingly.
We need a variety of different packages, which are loaded here. Install as needed. If you use others, load them here.
knitr::opts_chunk$set(error = TRUE, warning=FALSE)library('tidyr')
library('dplyr')##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library('forcats')
library('ggplot2')
library('corrplot') #to make a correlation plot. You can use other options/packages.## corrplot 0.84 loaded
library('visdat') #for missing data visualization
library('caret') #for model fitting## Loading required package: lattice
## Error: package or namespace load failed for 'caret' in loadNamespace(i, c(lib.loc, .libPaths()), versionCheck = vI[[i]]):
## namespace 'rlang' 0.3.4 is already loaded, but >= 0.4.0 is required
library('earth')## Error in library("earth"): there is no package called 'earth'
We will be exploring and fitting a dataset of norovirus outbreaks. You can look at the codebook, which briefly explains the meaning of each variable. If you are curious, you can check some previous papers that we published using (slighly different versions of) this dataset here and here.
raw_noro_data <- read.csv("./norodata.csv")## Error in file(file, "rt"): cannot open the connection
glimpse(raw_noro_data)## Error in glimpse(raw_noro_data): object 'raw_noro_data' not found
#Write code that loads the dataset and does a quick check to make sure the data loaded ok (using e.g. `str` and `summary` or similar such functions).
#data_raw <- (FILL HERE)Let’s assume that our main outcome of interest is the fraction of individuals that become infected in a given outbreak. The data reports that outcome (called RateAll), but we’ll also compute it ourselves so that we can practice creating new variables. To do so, take a look at the data (maybe peek at the Codebook) and decide which of the existing variables you should use to compute the new one. This new outcome variable will be added to the data frame.
Important Variables: Cases1 = Total number of suspected primary cases Cases2 = Total number of suspected secondary CasesAll Total number of suspected cases, including both primary and secondary cases RateAll = (secondary cases + primary cases/persons at risk)100 Rate1 = (primary cases/persons at risk)100 Rate2 = (secondary cases/persons at risk)*100 RiskAll = Total number of persons at risk of being a primary or secondary case
noro_add_new_var <- raw_noro_data %>% dplyr::mutate(fracinf = (CasesAll/RiskAll)*100)## Error in eval(lhs, parent, parent): object 'raw_noro_data' not found
head(noro_add_new_var$RateAll)## Error in head(noro_add_new_var$RateAll): object 'noro_add_new_var' not found
head(noro_add_new_var$fracinf)## Error in head(noro_add_new_var$fracinf): object 'noro_add_new_var' not found
# Use the `mutate()` function from the `dplyr` package to create a new column with this value. Call the new variable `fracinf`.
#d <- data_raw %>% dplyr::mutate(FILL HERE):::note Note the notation dplyr:: in front of mutate. This is not strictly necessary, but it helps in 2 ways. First, this tells the reader explicitly from which package the function comes. This is useful for quickly looking at the help file of the function, or if we want to adjust which packages are loaded/used. It also avoids occasional confusion if a function exists more than once (e.g. filter exists both in the stats and dplyr package). If the package is not specified, R takes the function from the package that was loaded last. This can sometimes produce strange error messages. I thus often (but not always) write the package name in front of the function. :::
:::fyi As you see in the Rmd file, the previous text box is created by placing texts between the ::: symbols and specifying some name. This allows you to apply your own styling to specific parts of the text. You define your style in a css file (here called customstyles.css), and you need to list that file in the _site.yml file. The latter file also lets you change the overall theme. You can choose from the library of free Bootswatch themes. :::
Use both text summaries and plots to take a look at the new variable you created to see if everything looks ok or if we need further cleaning.
glimpse(noro_add_new_var$fracinf)## Error in glimpse(noro_add_new_var$fracinf): object 'noro_add_new_var' not found
summary(noro_add_new_var$fracinf)## Error in summary(noro_add_new_var$fracinf): object 'noro_add_new_var' not found
str(noro_add_new_var$fracinf)## Error in str(noro_add_new_var$fracinf): object 'noro_add_new_var' not found
head(noro_add_new_var$fracinf)## Error in head(noro_add_new_var$fracinf): object 'noro_add_new_var' not found
noro_add_new_var %>% ggplot(aes(x = fracinf)) + geom_density()## Error in eval(lhs, parent, parent): object 'noro_add_new_var' not found
#Write code that takes a look at the values of the `fracinf` variable you created. Look at both text summaries and a figure.We notice there are NAs in this variable and the distribution is not normal. The latter is somewhat expected since our variable is a proportion, so it has to be between 0 and 1. There are also a lot of infinite values. Understand where they come from.
Let’s take a look at the RateAll variable recorded in the dataset and compare it to ours. First, create a plot that lets you quickly see if/how the variables differ.
noro_add_new_var %>% ggplot(aes(x = RateAll)) + geom_density()## Error in eval(lhs, parent, parent): object 'noro_add_new_var' not found
noro_add_new_var %>% ggplot(aes(x = fracinf)) + geom_density()## Error in eval(lhs, parent, parent): object 'noro_add_new_var' not found
noro_add_new_var %>% ggplot(aes(x = fracinf, y = RateAll)) + geom_point()## Error in eval(lhs, parent, parent): object 'noro_add_new_var' not found
noro_add_new_var %>% ggplot(aes(x = RateAll - fracinf, y = fracinf)) + geom_point()## Error in eval(lhs, parent, parent): object 'noro_add_new_var' not found
# Plot one variable on the x axis, the other on the y axis
# also plot the difference of the 2 variables
# make sure you adjust so both are in the same unitsBoth ways of plotting the data show that for most outbreaks, the two ways of getting the outcome agree. So that’s good. But we need to look closer and resolve the problem with infinite values above. Check to see what the RateAll variable has for those infinite values.
infinity_subset <- subset(noro_add_new_var, fracinf == "Inf")## Error in subset(noro_add_new_var, fracinf == "Inf"): object 'noro_add_new_var' not found
print(infinity_subset$RateAll)## Error in print(infinity_subset$RateAll): object 'infinity_subset' not found
#Write code that looks at the values of RateAll where we have inifinite valuesYou should find that all of the reported values are 0. So what makes more sense? You should have figured out that the infinite values in our computed variables arise because the RiskAll variable is 0. That variable contains the total number of persons at risk for an outbreak. If nobody is at risk of getting infected, of course, we can’t get any infected. So RateAll being 0 is technically correct. But does it make sense to include “outbreaks” in our analysis where nobody is at risk of getting infected? One should question how those got into the spreadsheet in the first place.
Having to deal with “weirdness” in your data like this example is common. You often need to make a decision based on best judgment.
Here, I think that if nobody is at risk, we shouldn’t include those outbreaks in further analysis. Thus, we’ll go with our computed outcome and remove all observations that have missing or infinite values for the outcome of interest, since those can’t be used for model fitting. Thus, we go ahead and remove any observations that have un-useable values in the outcome.
filter_Inf <- noro_add_new_var %>% dplyr::filter(!fracinf %in% c("Inf")) ## Error in eval(lhs, parent, parent): object 'noro_add_new_var' not found
filter_NA <- filter_Inf[!is.na(filter_Inf$fracinf), ]## Error in eval(expr, envir, enclos): object 'filter_Inf' not found
min(filter_NA$fracinf) #Check for zeros## Error in eval(expr, envir, enclos): object 'filter_NA' not found
max(filter_NA$fracinf) #Check for NAs## Error in eval(expr, envir, enclos): object 'filter_NA' not found
glimpse(filter_NA) #View total observations## Error in glimpse(filter_NA): object 'filter_NA' not found
unique(filter_NA$fracinf) #Final check of all unique entries to look for weird data## Error in unique(filter_NA$fracinf): object 'filter_NA' not found
#Write code that removes all observations that have an outcome that is not very useful, i.e. either NA or infinity. Then look at the outcome variable again to make sure things are fixed. Also check the size of the new dataset to see by how much it shrunk.You should find that we lost a lot of data, we are down to 579 observations (from a starting 1022). That would be troublesome for most studies if that would mean subjects drop out (that could lead to bias). Here it’s maybe less problematic since each observation is an outbreak collected from the literature. Still, dropping this many could lead to bias if all the ones that had NA or Infinity were somehow systematically different. It would be useful to look into and discuss in a real analysis.
Not uncommon for real datasets, this one has a lot of variables. Many are not too meaningful for modeling. Our question is what predicts the fraction of those that get infected, i.e., the new outcome we just created. We should first narrow down the predictor variables of interest based on scientific grounds.
For this analysis exercise, we just pick the following variables for further analysis: Action1, CasesAll, Category, Country, Deaths, GG2C4, Hemisphere, Hospitalizations, MeanA1, MeanD1, MeanI1, MedianA1, MedianD1, MedianI1, OBYear, Path1, RiskAll, Season, Setting, Trans1, Vomit. Of course, we also need to keep our outcome of interest.
Note that - as often happens for real data - there are inconsistencies between the codebook and the actual datasheet. Here, names of variables and spelling in the codebook do not fully agree with the data. The above list of variables is based on codebook, and you need to make sure you get the right names from the data when selecting those variables.
noro_data_reduce_vars <- filter_NA[, c("Action1", "CasesAll", "Category", "Country", "Deaths", "gg2c4", "Hemisphere", "Hospitalizations", "MeanA1", "MeanD1", "MeanI1", "MedianA1", "MedianI1", "OBYear", "Path1", "RiskAll", "season", "Setting_1", "Setting_2", "Trans1", "Vomit", "fracinf")]## Error in eval(expr, envir, enclos): object 'filter_NA' not found
glimpse(noro_data_reduce_vars)## Error in glimpse(noro_data_reduce_vars): object 'noro_data_reduce_vars' not found
#write code to select the specified variablesYour reduced dataset should contain 579 observations and 22 variables.
With this reduced dataset, we’ll likely still need to perform further cleaning. We can start by looking at missing data. While the summary function gives that information, it is somewhat tedious to pull out. We can just focus on NA for each variable and look at the text output, or for lots of predictors, a graphical view is easier to understand. The latter has the advantage of showing potential clustering of missing values.
print(colSums(is.na(noro_data_reduce_vars)))## Error in is.data.frame(x): object 'noro_data_reduce_vars' not found
# this code prints number of missing for each variable (assuming your dataframe is called d)
# print(colSums(is.na(d)))
vis_dat(noro_data_reduce_vars) #All data-grey is missing. Shows spread of other classes in observations as well.## Error in test_if_dataframe(x): object 'noro_data_reduce_vars' not found
vis_miss(noro_data_reduce_vars) #All data-seperates by present and absent, also gives percentages.## Error in test_if_dataframe(x): object 'noro_data_reduce_vars' not found
# write code to use the visdat R package, add code that plots a heatmap of missing values It looks like we have a lot of missing data for the MeanA1 and MedianA1 variables. If we wanted to keep those variables, we would be left with very few observations. So let’s drop those two variables. After that, we will drop all observations that have missing data (seems to be Hospitalization and Deaths).
drop_A1vars <- dplyr::select(noro_data_reduce_vars, -MeanA1, -MedianA1)## Error in dplyr::select(noro_data_reduce_vars, -MeanA1, -MedianA1): object 'noro_data_reduce_vars' not found
drop_A1vars <- na.omit(drop_A1vars)## Error in na.omit(drop_A1vars): object 'drop_A1vars' not found
# write code to remove the 2 "A1" variables, then drop all remaining observations with NALet’s now check the format of each variable. Depending on how you loaded the data, some variables might not be in the right format. Make sure everything that should be numeric is numeric/integer, everything that should be a factor is a factor. There should be no variable coded as character. Once all variables have the right format, take a look at the data again.
glimpse(drop_A1vars) #Should OBYear be a factor or integer. Will convert to integer.## Error in glimpse(drop_A1vars): object 'drop_A1vars' not found
drop_A1vars$OBYear <- as.integer(as.factor(drop_A1vars$OBYear))## Error in is.factor(x): object 'drop_A1vars' not found
# write code to format variables as neededTake another look at the data. You should find that for the dataset, most things look reasonable, but the variable Setting_1 has a lot of different levels/values. That many categories, most with only a single entry, will likely not be meaningful for modeling. One option is to drop the variable. But assume we think it’s an important variable to include and we are especially interested in the difference between restaurant settings and other settings. We could then create a new variable that has only two levels, Restaurant and Other.
unique(drop_A1vars$Setting_1)## Error in unique(drop_A1vars$Setting_1): object 'drop_A1vars' not found
modify_Restaurant <- drop_A1vars %>% mutate(Setting_1 = recode(Setting_1,
"restaurant" = "Restaurant",
"take-out restaurant" = "Restaurant",
"Luncheon and Restaruant" = "Restaurant",
"Catering service in Restaurant" = "Restaurant",
"restaurant in Northern Territory" = "Restaurant",
"Shared meal at a restaurant" = "Restaurant",
"Resaurant" = "Restaurant",
"restaurant; catered party" = "Restaurant"))## Error in eval(lhs, parent, parent): object 'drop_A1vars' not found
find_Restaurant <- modify_Restaurant[grep("Restaurant", modify_Restaurant$Setting_1), ]## Error in eval(expr, envir, enclos): object 'modify_Restaurant' not found
add_new_Setting_Rest <- find_Restaurant %>% dplyr::mutate(Setting = "Restaurant")## Error in eval(lhs, parent, parent): object 'find_Restaurant' not found
find_not_Restaurant <- dplyr::filter(modify_Restaurant, !grepl("Restaurant", Setting_1))## Error in dplyr::filter(modify_Restaurant, !grepl("Restaurant", Setting_1)): object 'modify_Restaurant' not found
add_new_Setting_not_Rest <- find_not_Restaurant %>% dplyr::mutate(Setting = "Other")## Error in eval(lhs, parent, parent): object 'find_not_Restaurant' not found
combined_data <- merge(add_new_Setting_not_Rest, add_new_Setting_Rest, all = TRUE)## Error in merge(add_new_Setting_not_Rest, add_new_Setting_Rest, all = TRUE): object 'add_new_Setting_not_Rest' not found
combined_data$Setting_1 <- NULL## Error in combined_data$Setting_1 <- NULL: object 'combined_data' not found
complete_combined_data <- combined_data## Error in eval(expr, envir, enclos): object 'combined_data' not found
#write code that creates a new variable called `Setting` based on `Setting_1` buth with only 2 levels, `Restaurant` and `Other`. Then remove the `Setting_1` variable. Note that restaurant is sometimes capitalized and sometimes not. You need to fix that first. For these lines of code, the 'Factor' chapter in R4DS might be helpful here.Next, let’s create a few plots showing the outcome and the predictors.
library(gridExtra)##
## Attaching package: 'gridExtra'
## The following object is masked from 'package:dplyr':
##
## combine
cases_frac <- complete_combined_data %>% ggplot() + geom_point(aes(x = CasesAll, y = fracinf))## Error in eval(lhs, parent, parent): object 'complete_combined_data' not found
deaths_frac <- complete_combined_data %>% ggplot() + geom_point(aes(x = Deaths, y = fracinf))## Error in eval(lhs, parent, parent): object 'complete_combined_data' not found
hos_frac <- complete_combined_data %>% ggplot() + geom_point(aes(x = Hospitalizations, y = fracinf))## Error in eval(lhs, parent, parent): object 'complete_combined_data' not found
md1_frac <- complete_combined_data %>% ggplot() + geom_point(aes(x = MeanD1, y = fracinf))## Error in eval(lhs, parent, parent): object 'complete_combined_data' not found
mi1_frac <- complete_combined_data %>% ggplot() + geom_point(aes(x = MeanI1, y = fracinf))## Error in eval(lhs, parent, parent): object 'complete_combined_data' not found
medi1_frac <- complete_combined_data %>% ggplot() + geom_point(aes(x = MedianI1, y = fracinf)) ## Error in eval(lhs, parent, parent): object 'complete_combined_data' not found
risk_frac <- complete_combined_data %>% ggplot() + geom_point(aes(x = RiskAll, y = fracinf)) ## Error in eval(lhs, parent, parent): object 'complete_combined_data' not found
vomit_frac <- complete_combined_data %>% ggplot() + geom_point(aes(x = Vomit, y = fracinf))## Error in eval(lhs, parent, parent): object 'complete_combined_data' not found
grid.arrange(cases_frac, deaths_frac, hos_frac, md1_frac, mi1_frac, medi1_frac, risk_frac, vomit_frac, nrow = 3)## Error in arrangeGrob(...): object 'cases_frac' not found
#write code that produces plots showing our outcome of interest on the y-axis and each numeric predictor on the x-axis.
#you can use the facet_wrap functionality in ggplot for it, or do it some other way.One thing I notice in the plots is that there are lots of zeros for many predictors and things look skewed. That’s ok, but means we should probably standardize these predictors. One strange finding (that I could have caught further up when printing the numeric summaries, but didn’t) is that there is (at least) one outbreak that has outbreak year reported as 0. That is, of course, wrong and needs to be fixed. There are different ways of fixing it, the best, of course, would be to trace it back and try to fix it with the right value. We won’t do that here. Instead, we’ll remove that observation.
complete_combined_data$OBYear == 0## Error in eval(expr, envir, enclos): object 'complete_combined_data' not found
minus_zero_year <- complete_combined_data[-c(457), ]## Error in eval(expr, envir, enclos): object 'complete_combined_data' not found
minus_zero_year$OBYear == 0## Error in eval(expr, envir, enclos): object 'minus_zero_year' not found
# write code that figures out which observation(s) have 0 years and remove those from the dataset.
# do some quick check to make sure OByear values are all reasonable nowAnother useful check is to see if there are strong correlations between some of the numeric predictors. That might indicate collinearity, and some models can’t handle that very well. In such cases, one might want to remove a predictor. We’ll create a correlation plot of the numeric variables to inspect this.
corr_subset <- minus_zero_year[c(2, 5, 8, 9 , 10, 11, 14, 18)]## Error in eval(expr, envir, enclos): object 'minus_zero_year' not found
corr_matrix <- cor(corr_subset)## Error in is.data.frame(x): object 'corr_subset' not found
corrplot(corr_matrix, method = "square")## Error in corrplot(corr_matrix, method = "square"): object 'corr_matrix' not found
# using e.g. the corrplot package (or any other you like), create a correlation plot of the numeric variablesIt doesn’t look like there are any very strong correlations between continuous variables, so we can keep them all for now.
Next, let’s create plots for the categorical variables, again our main outcome of interest on the y-axis.
Action1_frac <- minus_zero_year %>% ggplot() + geom_bar(aes(x = Action1, y = fracinf), stat = "identity")## Error in eval(lhs, parent, parent): object 'minus_zero_year' not found
Cat_frac <- minus_zero_year %>% ggplot() + geom_bar(aes(x = Category, y = fracinf), stat = "identity")## Error in eval(lhs, parent, parent): object 'minus_zero_year' not found
Coun_frac <- minus_zero_year %>% ggplot() + geom_bar(aes(x = Country, y = fracinf), stat = "identity")## Error in eval(lhs, parent, parent): object 'minus_zero_year' not found
gg2c4_frac <- minus_zero_year %>% ggplot() + geom_bar(aes(x = gg2c4, y = fracinf), stat = "identity")## Error in eval(lhs, parent, parent): object 'minus_zero_year' not found
Hemi_frac <- minus_zero_year %>% ggplot() + geom_bar(aes(x = Hemisphere, y = fracinf), stat = "identity")## Error in eval(lhs, parent, parent): object 'minus_zero_year' not found
OBY_frac <- minus_zero_year %>% ggplot() + geom_bar(aes(x = OBYear, y = fracinf), stat = "identity")## Error in eval(lhs, parent, parent): object 'minus_zero_year' not found
Path1_frac <- minus_zero_year %>% ggplot() + geom_bar(aes(x = Path1, y = fracinf), stat = "identity")## Error in eval(lhs, parent, parent): object 'minus_zero_year' not found
Season_frac <- minus_zero_year %>% ggplot() + geom_bar(aes(x = season, y = fracinf), stat = "identity")## Error in eval(lhs, parent, parent): object 'minus_zero_year' not found
Setting2_frac <- minus_zero_year %>% ggplot() + geom_bar(aes(x = Setting_2, y = fracinf), stat = "identity")## Error in eval(lhs, parent, parent): object 'minus_zero_year' not found
Trans1_frac <- minus_zero_year %>% ggplot() + geom_bar(aes(x = Trans1, y = fracinf), stat = "identity")## Error in eval(lhs, parent, parent): object 'minus_zero_year' not found
Setting_frac <- minus_zero_year %>% ggplot() + geom_bar(aes(x = Setting, y = fracinf), stat = "identity")## Error in eval(lhs, parent, parent): object 'minus_zero_year' not found
grid.arrange(Action1_frac, Cat_frac, Coun_frac, gg2c4_frac, Hemi_frac, OBY_frac, Path1_frac, Season_frac, Setting2_frac, Trans1_frac, Setting_frac, nrow = 3)## Error in arrangeGrob(...): object 'Action1_frac' not found
#write code that produces plots showing our outcome of interest on the y-axis and each categorical predictor on the x-axis.
#you can use the facet_wrap functionality in ggplot for it, or do it some other way.The plots do not look pretty, which is ok for exploratory. We can see that a few variables have categories with very few values (again, something we could have also seen using summary, but graphically it is usually easier to see). This will likely produce problems when we fit using cross-validation, so we should fix that. Options we have:
Setting variable.Let’s use a mix of these approaches. We’ll drop the Category variable, we’ll remove the observation(s) with Unspecified in the Hemisphere variable, and we’ll combine Unknown with Unspecified for Action1 and Path1 variables.
# write code that implements the cleaning steps described above.
drop_cat <- dplyr::select(minus_zero_year, -Category)## Error in dplyr::select(minus_zero_year, -Category): object 'minus_zero_year' not found
modify_hemi <- drop_cat[!drop_cat$Hemisphere == "Unspecified", ]## Error in eval(expr, envir, enclos): object 'drop_cat' not found
recode_vars <- modify_hemi %>% dplyr::mutate(Action1 = recode(Action1, "Unknown" = "Unspecified")) %>% dplyr::mutate(Path1 = recode(Path1, "Unknown" = "Unspecified"))## Error in eval(lhs, parent, parent): object 'modify_hemi' not found
# then check again (e.g. with a plot) to make sure things worked
check_hemi <- recode_vars %>% ggplot() + geom_bar(aes(x = Hemisphere, y = fracinf), stat = "identity")## Error in eval(lhs, parent, parent): object 'recode_vars' not found
check_action <- recode_vars %>% ggplot() + geom_bar(aes(x = Action1, y = fracinf), stat = "identity")## Error in eval(lhs, parent, parent): object 'recode_vars' not found
check_path <- recode_vars %>% ggplot() + geom_bar(aes(x = Path1, y = fracinf), stat = "identity")## Error in eval(lhs, parent, parent): object 'recode_vars' not found
grid.arrange(check_hemi, check_action, check_path, nrow = 1)## Error in arrangeGrob(...): object 'check_hemi' not found
recode_vars$Setting <- as.factor(as.character(recode_vars$Setting))## Error in is.factor(x): object 'recode_vars' not found
At this step, you should have a dataframe containing 551 observations, and 19 variables: 1 outcome, 9 numeric/integer predictors, and 9 factor variables. There should be no missing values.
recode_vars$Setting <- as.factor(as.character(recode_vars$Setting))## Error in is.factor(x): object 'recode_vars' not found
model_data <- recode_vars## Error in eval(expr, envir, enclos): object 'recode_vars' not found
glimpse(model_data)## Error in glimpse(model_data): object 'model_data' not found
We can finally embark on some modeling - or at least we can get ready to do so.
We will use a lot of the caret package functionality for the following tasks. You might find the package website useful as you try to figure things out.
Depending on the data and question, we might want to reserve some of the data for a final validation/testing step or not. Here, to illustrate this process and the idea of reserving some data for the very end, we’ll split things into a train and test set. All the modeling will be done with the train set, and final evaluation of the model(s) happens on the test set. We use the caret package for this.
#this code does the data splitting. I still assume that your data is stored in the `d` object.
set.seed(123)
trainset <- caret::createDataPartition(y = model_data$fracinf, p = 0.7, list = FALSE)## Error in loadNamespace(i, c(lib.loc, .libPaths()), versionCheck = vI[[i]]): namespace 'rlang' 0.3.4 is already loaded, but >= 0.4.0 is required
data_train = model_data[trainset,] #extract observations/rows for training, assign to new variable## Error in eval(expr, envir, enclos): object 'model_data' not found
data_test = model_data[-trainset,] #do the same for the test set## Error in eval(expr, envir, enclos): object 'model_data' not found
:::note Since the above code involves drawing samples, and we want to do that reproducible, we also set a random number seed with set.seed(). With that, each time we perform this sampling, it will be the same, unless we change the seed. If nothing about the code changes, setting the seed once at the beginning is enough. If you want to be extra sure, it is a good idea to set the seed at the beginning of every code chunk that involves random numbers (i.e., sampling or some other stochastic/random procedure). We do that here. :::
Now let’s begin with the model fitting. We’ll start by looking at a null model, which is just the mean of the data. This is, of course, a stupid “model” but provides some baseline for performance.
mean(data_train$fracinf) #Seems a bit too easy, I am unsure about this...## Error in mean(data_train$fracinf): object 'data_train' not found
#write code that computes the RMSE for a null model, which is just the mean of the outcome
#remember that from now on until the end, everything happens with the training dataNow we’ll fit the outcome to each predictor one at a time. To evaluate our model performance, we will use cross-validation and the caret package. Note that we just fit a linear model. caret itself is not a model. Instead, it provides an interface that allows easy access to many different models and has functions to do a lot of steps quickly - as you will see below. Most of the time, you can do all our work through the caret (or mlr) workflow. The problem is that because caret calls another package/function, sometimes things are not as clear, especially when you get an error message. So occasionally, if you know you want to use a specific model and want more control over things, you might want to not use caret and instead go straight to the model function (e.g. lm or glm or…). We’ve done a bit of that before, for the remainder of the class we’ll mostly access underlying functions through caret.
data_train <- data_train[, c(18,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,19)] #Reorder cols bring fracinf to front. Give it a seperate chunk or else every rerun of the code will continue to change the order of variables.## Error in eval(expr, envir, enclos): object 'data_train' not found
#There is probably a nicer tidyverse way of doing this. I just couldn't think of it, so did it this way.
set.seed(1111) #makes each code block reproducible
fitControl <- trainControl(method="repeatedcv",number=5,repeats=5) #setting CV method for caret## Error in trainControl(method = "repeatedcv", number = 5, repeats = 5): could not find function "trainControl"
Npred <- ncol(data_train)-1 # number of predictors## Error in ncol(data_train): object 'data_train' not found
resultmat_single <- data.frame(Variable = names(data_train)[-1], RMSE = rep(0,Npred)) #store values for RMSE for each variable## Error in data.frame(Variable = names(data_train)[-1], RMSE = rep(0, Npred)): object 'data_train' not found
for (n in 2:ncol(data_train)) #loop over each predictor. For this to work, outcome must be in 1st column
{
fit0 <- train( as.formula(paste("fracinf ~",names(data_train)[n])) , data = data_train, method = "lm", trControl = fitControl)
resultmat_single[n-1,2]= fit0$results$RMSE
}## Error in ncol(data_train): object 'data_train' not found
print(resultmat_single)## Error in print(resultmat_single): object 'resultmat_single' not found
This analysis shows 2 things that might need closer inspections. We get some error/warning messages, and most RMSE of the single-predictor models are not better than the null model. Usually, this is cause for more careful checking until you fully understand what is going on. But for this exercise, let’s blindly press on!
Now let’s perform fitting with multiple predictors. Use the same setup as the code above to fit the outcome to all predictors at the same time. Do that for 3 different models: linear (lm), regression splines (earth), K nearest neighbor (knn). You might have to install/load some extra R packages for that. If that’s the case, caret will tell you.
set.seed(1111) #lm fit
fitControl <- trainControl(method="repeatedcv",number=5,repeats=5)## Error in trainControl(method = "repeatedcv", number = 5, repeats = 5): could not find function "trainControl"
fit1 <- train(fracinf~ ., data = data_train, method = "lm", trControl = fitControl) ## Error in train(fracinf ~ ., data = data_train, method = "lm", trControl = fitControl): could not find function "train"
print(fit1)## Error in print(fit1): object 'fit1' not found
#report the RMSE for each method. Note that knn and earth perform some model tuning (we'll discuss this soon) and report multiple RMSE. Use the lowest value.set.seed(1111) #earth fit
fitControl <- trainControl(method="repeatedcv",number=5,repeats=5) ## Error in trainControl(method = "repeatedcv", number = 5, repeats = 5): could not find function "trainControl"
fit2 <- train(fracinf~ ., data = data_train, method = "earth", trControl = fitControl) ## Error in train(fracinf ~ ., data = data_train, method = "earth", trControl = fitControl): could not find function "train"
print(fit2)## Error in print(fit2): object 'fit2' not found
set.seed(1111)
fitControl <- trainControl(method="repeatedcv",number=5,repeats=5)## Error in trainControl(method = "repeatedcv", number = 5, repeats = 5): could not find function "trainControl"
fit3 <- train(fracinf~ ., data = data_train, method = "knn", trControl = fitControl) ## Error in train(fracinf ~ ., data = data_train, method = "knn", trControl = fitControl): could not find function "train"
print(fit3)## Error in print(fit3): object 'fit3' not found
So we find that some of these models do better than the null model and the single-predictor ones. KNN seems the best of those 3. Next, we want to see if pre-processing our data a bit more might lead to even better results.
Above, we fit outcome and predictors without doing anything to them. Let’s see if some further processing improves the performance of our multi-predictor models.
First, we look at near-zero variance predictors. Those are predictors that have very little variation. For instance, for a categorical predictor, if 99% of the values are a single category, it is likely not a useful predictor. A similar idea holds for continuous predictors. If they have very little spread, they might likely not contribute much ‘signal’ to our fitting and instead mainly contain noise. Some models, such as trees, which we’ll cover soon, can ignore useless predictors and just remove them. Other models, e.g., linear models, are generally performing better if we remove such useless predictors.
Note that in general, one should apply all these processing steps to the training data only. Otherwise, you would use information from the test set to decide on data manipulations for all data (called data leakage). It is a bit hard to say when to make the train/test split. Above, we did a good bit of cleaning on the full dataset before we split. One could argue that one should split right at the start, then do the cleaning. However, this doesn’t work for certain procedures (e.g., removing observations with NA).
nearZeroVar(data_train, saveMetrics = TRUE)## Error in nearZeroVar(data_train, saveMetrics = TRUE): could not find function "nearZeroVar"
#write code using the caret function `nearZeroVar` to look at potential uninformative predictors. Set saveMetrics to TRUE. Look at the results You’ll see that several variables are flagged as having near-zero variance. Look for instance at Deaths, you’ll see that almost all outbreaks have zero deaths. It is a judgment call if we should remove all those flagged as near-zero-variance or not. For this exercise, we will.
near_zero <- nearZeroVar(data_train)## Error in nearZeroVar(data_train): could not find function "nearZeroVar"
train_drop_near_zero <- data_train[, -near_zero]## Error in eval(expr, envir, enclos): object 'data_train' not found
dim(train_drop_near_zero)## Error in eval(expr, envir, enclos): object 'train_drop_near_zero' not found
#write code that removes all variables with near zero variance from the data You should be left with 13 variables (including the outcome).
Next, we noticed during our exploratory analysis that it might be useful to center and scale predictors. So let’s do that now. With caret, one can do that by providing the preProc setting inside the train function. Set it to center and scale the data, then run the 3 models from above again.
set.seed(1111) #Preprocessed lm fit
fitControl <- trainControl(method="repeatedcv",number=5,repeats=5)## Error in trainControl(method = "repeatedcv", number = 5, repeats = 5): could not find function "trainControl"
pfit1 <- train(fracinf~ ., data = train_drop_near_zero, method = "lm", preProc = c("center", "scale"), trControl = fitControl) ## Error in train(fracinf ~ ., data = train_drop_near_zero, method = "lm", : could not find function "train"
print(pfit1)## Error in print(pfit1): object 'pfit1' not found
#write code that repeats the multi-predictor fits from above, but this time applies centering and scaling of variables.
#look at the RMSE for the new fits
#Lots of variables with zero variance set.seed(1111) #Preprocessed earth fit
fitControl <- trainControl(method="repeatedcv",number=5,repeats=5)## Error in trainControl(method = "repeatedcv", number = 5, repeats = 5): could not find function "trainControl"
pfit2 <- train(fracinf~ ., data = train_drop_near_zero, method = "earth", preProc = c("center", "scale"), trControl = fitControl) ## Error in train(fracinf ~ ., data = train_drop_near_zero, method = "earth", : could not find function "train"
print(pfit2)## Error in print(pfit2): object 'pfit2' not found
set.seed(1111) #Preprocessed knn fit
fitControl <- trainControl(method="repeatedcv",number=5,repeats=5)## Error in trainControl(method = "repeatedcv", number = 5, repeats = 5): could not find function "trainControl"
pfit3 <- train(fracinf~ ., data = train_drop_near_zero, method = "knn", preProc = c("center", "scale"), trControl = fitControl) ## Error in train(fracinf ~ ., data = train_drop_near_zero, method = "knn", : could not find function "train"
print(pfit3)## Error in print(pfit3): object 'pfit3' not found
So it looks like the linear mode got a bit better, KNN actually got worse, and MARS didn’t change much. Since for KNN, “the data is the model”, removing some predictors might have had a detrimental impact. Though to say something more useful, I would want to look much closer into what’s going on and if these pre-processing steps are useful or not. For this exercise, let’s move on.
We can look at the uncertainty in model performance, e.g., the RMSE. Let’s look at it for the models fit to the un-processed data.
resamples_nonprocessed <- resamples(list(fit1, fit2, fit3))## Error in resamples(list(fit1, fit2, fit3)): could not find function "resamples"
resamples_nonprocessed %>% ggplot()## Error in eval(lhs, parent, parent): object 'resamples_nonprocessed' not found
#Use the `resamples` function in caret to extract uncertainty from the 3 models fit to the data that doesn't have predictor pre-processing, then plot itIt seems that the model uncertainty for the outcome is fairly narrow for all models. We can (and in a real setting should) do further explorations to decide which model to choose. This is based part on what the model results are, and part on what we want. If we want a very simple, interpretable model, we’d likely use the linear model. If we want a model that has better performance, we might use MARS or - with the un-processed dataset - KNN.
For this exercise, let’s just pick one model. We’ll go with the best performing one, namely KNN (fit to non-pre-processed data). Let’s take a look at the residual plot.
#Write code to get model predictions for the outcome on the training data, and plot it as function of actual outcome values.
data_train$knn_predict <- predict(fit3)## Error in predict(fit3): object 'fit3' not found
ggplot(data_train, aes(x = data_train$knn_predict, y = data_train$fracinf)) + geom_point() +xlab("Predicted") + ylab("Actual")## Error in ggplot(data_train, aes(x = data_train$knn_predict, y = data_train$fracinf)): object 'data_train' not found
#Not bad!
#also compute residuals (the difference between prediction and actual outcome) and plot that
data_train$knn_residuals <- residuals(fit3)## Error in residuals(fit3): object 'fit3' not found
ggplot(data_train, aes(x = data_train$knn_predict, y = data_train$knn_residuals)) + geom_point() + xlab("Predicted") + ylab("Residuals")## Error in ggplot(data_train, aes(x = data_train$knn_predict, y = data_train$knn_residuals)): object 'data_train' not found
Both plots look ok, predicted vs. outcome is along the 45-degree line, and the residual plot shows no major pattern. Of course, for a real analysis, we would again want to dig a bit deeper. But we’ll leave it at this for now.
Let’s do a final check, evaluate the performance of our final model on the test set.
data_test$knn_predict <- predict(fit3, data_test)## Error in predict(fit3, data_test): object 'fit3' not found
ggplot(data_test, aes(x = data_test$knn_predict, y = data_test$fracinf)) + geom_point() +xlab("Predicted") + ylab("Actual")## Error in ggplot(data_test, aes(x = data_test$knn_predict, y = data_test$fracinf)): object 'data_test' not found
test_residuals <- residuals(fit3, data_test)## Error in residuals(fit3, data_test): object 'fit3' not found
RMSE(data_test$fracinf, data_test$knn_predict) #RMSE = 11.35577## Error in RMSE(data_test$fracinf, data_test$knn_predict): could not find function "RMSE"
sum(test_residuals^2) #SSR = 24883.19## Error in eval(expr, envir, enclos): object 'test_residuals' not found
#Write code that computes model predictions and for test data, then compute SSR and RMSE.Since we have a different number of observations, the result isn’t expected to be quite the same as for the training data (despite dividing by sample size to account for that). But it’s fairly close, and surprisingly not actually worse. So the KNN model seems to be reasonable at predicting. Now if its performance is ‘good enough’ is a scientific question.
We will leave it at this, for now, we will likely (re)visit some other topics soon as we perform more such analysis exercises in upcoming weeks. But you are welcome to keep exploring this dataset and try some of the other bits and pieces we covered.